Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: correlation support #456

Merged
merged 5 commits into from
May 23, 2024
Merged

feat: correlation support #456

merged 5 commits into from
May 23, 2024

Conversation

huaxingao
Copy link
Contributor

@huaxingao huaxingao commented May 21, 2024

Which issue does this PR close?

Supports CORR
The implementation mostly is the same as the DataFusion's implementation. The reason
we have our own implementation is that DataFusion has UInt64 for state_field count,
while Spark has Double for count. Also adding null_on_divide_by_zero
to be consistent with Spark's implementation.
Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@huaxingao huaxingao closed this May 21, 2024
@huaxingao huaxingao reopened this May 21, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 85.71429% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 34.02%. Comparing base (14494d3) to head (fead995).
Report is 26 commits behind head on main.

Files Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 85.71% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #456      +/-   ##
============================================
- Coverage     34.02%   34.02%   -0.01%     
- Complexity      857      859       +2     
============================================
  Files           116      116              
  Lines         38565    38671     +106     
  Branches       8517     8564      +47     
============================================
+ Hits          13120    13156      +36     
- Misses        22691    22753      +62     
- Partials       2754     2762       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.setCorrelation(corrBuilder)
.build())
} else {
None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add withInfo()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot this. Added.

sql(s"insert into $table values(1, 4, 1), (2, 5, 1), (3, 6, 2)")
val expectedNumOfCometAggregates = 2

sql("SELECT corr(col1, col2) FROM test GROUP BY col3").show
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can reuse s$table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@huaxingao
Copy link
Contributor Author

cc @andygrove @viirya

@@ -1212,6 +1212,157 @@ class CometAggregateSuite extends CometTestBase with AdaptiveSparkPlanHelper {
}
}

test("correlation") {
withSQLConf(CometConf.COMET_EXEC_SHUFFLE_ENABLED.key -> "true") {
Seq(false).foreach { cometColumnShuffleEnabled =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intentional that we only test for cometColumnShuffleEnabled==false? I have similar questions for parquet.enable.dictionary and nullOnDivideByZero, where we also iterate over a single boolean and don't test both for true and false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be both true and false. I removed one of them so it's easier for me to debug. Forgot to put back.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@viirya viirya merged commit 6d23e28 into apache:main May 23, 2024
40 checks passed
@viirya
Copy link
Member

viirya commented May 23, 2024

Merged. Thanks @huaxingao @andygrove @kazuyukitanimura

@huaxingao
Copy link
Contributor Author

Thanks @viirya @andygrove @kazuyukitanimura

@huaxingao huaxingao deleted the correlation branch May 23, 2024 15:23
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* feat: correlation support

* fmt

* remove un-used import

* address comments

* address comment

---------

Co-authored-by: Huaxin Gao <[email protected]>
(cherry picked from commit 6d23e28)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants